online supplement
Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics
Dahlem, Dominik, Maniloff, Diego, Misiura, Mac
Large language models hallucinate in predictable ways: attention routing fails by over-concentrating on a narrow set of positions, or by spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport capacity; we prove that every transpose-invariant spectral diagnostic of this operator is structurally orientation-blind (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a quantitative converse establishing the asymmetry coefficient $G$ as the unique control parameter for direction. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an $n$-independent floor $ϕ\ge 1/5$ with worst cut at $t^\ast/n \approx 0.32$, while window attention pierces the floor as $O(w/n)$; failure modes are shape-different, not just value-different. The resulting two-axis diagnostic ($ϕ$ for capacity, $G$ for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (LC-AUROC from 0.62 to 0.84) on tested models up to 8B parameters, with polarity reversing as predicted between HaluEval and MedHallu.
Personalized Machine Learning: Online Supplement
The book is currently available in draft form as a downloadable pdf. Every day we interact with machine learning systems that personalize their predictions to individual users, whether to recommend movies, find new friends or dating partners, or organize our news feeds. Such systems involve several modalities of data, ranging from sequences of clicks or purchases, to rich modalities involving text, images, or social interactions. While settings and data modalities vary significantly, in this book we introduce a common set of principles and methods that underpin the design of personalized predictive models. The book begins by revising "traditional" machine learning models, with a special focus on how they should be adapted to settings involving user data.
Automated extraction of mutual independence patterns using Bayesian comparison of partition models
Marrelec, Guillaume, Giron, Alain
Mutual independence is a key concept in statistics that characterizes the structural relationships between variables. Existing methods to investigate mutual independence rely on the definition of two competing models, one being nested into the other and used to generate a null distribution for a statistic of interest, usually under the asymptotic assumption of large sample size. As such, these methods have a very restricted scope of application. In the present manuscript, we propose to change the investigation of mutual independence from a hypothesis-driven task that can only be applied in very specific cases to a blind and automated search within patterns of mutual independence. To this end, we treat the issue as one of model comparison that we solve in a Bayesian framework. We show the relationship between such an approach and existing methods in the case of multivariate normal distributions as well as cross-classified multinomial distributions. We propose a general Markov chain Monte Carlo (MCMC) algorithm to numerically approximate the posterior distribution on the space of all patterns of mutual independence. The relevance of the method is demonstrated on synthetic data as well as two real datasets, showing the unique insight provided by this approach.
Detection of Block-Exchangeable Structure in Large-Scale Correlation Matrices
Perreault, Samuel, Duchesne, Thierry, Nešlehová, Johanna G.
Correlation matrices are omnipresent in multivariate data analysis. When the number $d$ of variables is large, the sample estimates of correlation matrices are typically noisy and conceal underlying dependence patterns. We consider the case when the variables can be grouped into $K$ clusters with exchangeable dependence; an assumption often made in applications in finance and econometrics. Under this partial exchangeability condition, the corresponding correlation matrix has a block structure and the number of unknown parameters is reduced from $d(d-1)/2$ to at most $K(K+1)/2$. We propose a robust algorithm based on Kendall's rank correlation to identify the clusters without assuming the knowledge of $K$ a priori or anything about the margins except continuity. The corresponding block-structured estimator performs considerably better than the sample Kendall rank correlation matrix when $K < d$. Even in the unstructured case $K = d$, though there is no gain asymptotically, the new estimator can be much more efficient in finite samples. When the data are elliptical, the results extend to linear correlation matrices and their inverses. The procedure is illustrated on financial stock returns.
Non-stationary Stochastic Optimization with Local Spatial and Temporal Changes
Chen, Xi, Wang, Yining, Wang, Yu-Xiang
We consider a non-stationary sequential stochastic optimization problem, in which the underlying cost functions change over time under a variation budget constraint. We propose an $L_{p,q}$-variation functional to quantify the change, which captures local spatial and temporal variations of the sequence of functions. Under the $L_{p,q}$-variation functional constraint, we derive both upper and matching lower regret bounds for smooth and strongly convex function sequences, which generalize previous results in (Besbes et al., 2015). Our results reveal some surprising phenomena under this general variation functional, such as the curse of dimensionality of the function domain. The key technical novelties in our analysis include an affinity lemma that characterizes the distance of the minimizers of two convex functions with bounded $L_p$ difference, and a cubic spline based construction that attains matching lower bounds.